Turkish Electricity Consumption is really important matter to work on due to supply’s dependence on consumption demand. Therefore careful and meticilous modelling and accurate estimations are required, and this project will present a way to perform such modelling and its results.
library(forecast)
library(tseries)
library(ggplot2)
library(zoo)
library(xts)
library(dplyr)
library(tidyr)
library(MLmetrics)
library(urca)
library(astsa)
library(readr)
library(data.table)
library(readr)
Data is obtained from EPIAS - EXIST Transparency Platform. Following chunk imports and manipulates a little.
I will transform the data into daily series since it will be better to estimate daily values values as it was mentioned in the lectures. Can be extended to hourly easily.
# consumption indices and data
getwd()
bulk_consumption_with_temp <- read.csv("IEdata.csv")
names(bulk_consumption_with_temp) <- c("Date", "Hour", "Consumption")
bulk_consumption_with_temp <- as.data.table(bulk_consumption_with_temp)
bulk_consumption =bulk_consumption_with_temp %>%
group_by(Date) %>%
summarise(Consumption = sum(Consumption)
)
bulk_consumption$Date <- as.Date(bulk_consumption$Date)
head(bulk_consumption)
Feature dataset is obtained from our group’s project feature dataset. It includes some thought features that may impact the consumption values. Appropriate length of data is given by 1484 value since the dataset is a little longer.
# feature indices and data
all_features_df <- read.csv("all_features_with_dummies.csv")
# head(all_features_df)
# tail(all_features_df)
dt_daily <- head(all_features_df, 1484)
# tail(dt_daily)
dt_daily <- as.data.table(dt_daily)
tail(dt_daily)
## date covid_severity sunlight_time_minutes production_capacity_rate
## 1: 2021-01-18 49.48276 579 75.5
## 2: 2021-01-19 70.68966 580 75.5
## 3: 2021-01-20 51.60345 582 75.5
## 4: 2021-01-21 40.29310 584 75.5
## 5: 2021-01-22 45.09550 586 75.5
## 6: 2021-01-23 43.87804 588 75.5
## price_of_electricity laborforce export monday tuesday wednesday thursday
## 1: 551.5 50.90053 14486.97 1 0 0 0
## 2: 551.5 50.90053 14486.97 0 1 0 0
## 3: 551.5 50.90053 14486.97 0 0 1 0
## 4: 551.5 50.90053 14486.97 0 0 0 1
## 5: 551.5 50.90053 14486.97 0 0 0 0
## 6: 551.5 50.90053 14486.97 0 0 0 0
## friday saturday sunday new_year nat_holiday sacrifice_holiday sacrifice_eve
## 1: 0 0 0 0 0 0 0
## 2: 0 0 0 0 0 0 0
## 3: 0 0 0 0 0 0 0
## 4: 0 0 0 0 0 0 0
## 5: 1 0 0 0 0 0 0
## 6: 0 1 0 0 0 0 0
## ramadan_holiday ramadan_eve monday_or_friday_between_holidays extra_holidays
## 1: 0 0 0 0
## 2: 0 0 0 0
## 3: 0 0 0 0
## 4: 0 0 0 0
## 5: 0 0 0 0
## 6: 0 0 0 0
## partial_lockdown full_lockdown
## 1: 1 0
## 2: 1 0
## 3: 1 0
## 4: 1 0
## 5: 1 0
## 6: 0 1
The manipualtion, test and train dataset division is performed next. Train data is as specified in the data 2021-01-09. Test data is between 2021-01-10 and 2021-01-23, as it is available and will be used to measure performance only.
# test train prediction determination
test_dt_daily<- dt_daily[1471:(nrow(dt_daily)),]
dt_daily <- dt_daily[1:1470,]
guess_ahead <- nrow(test_dt_daily)
train_consumption_daily <- bulk_consumption$Consumption[1:1470]
test_consumption_daily <- tail(bulk_consumption$Consumption, (length(bulk_consumption$Consumption)-nrow(dt_daily)))
Next, we will convert the dataset into a time-series dataset with weekly frequency as it is mentioned in the lectures, electricity consumption follows a weekly seasonal pattern. Therefore, frequency is set to 7.
Decomposition, residual analysis that defines stationarity property of the data (residuals) through unit-root test is also performed here.
Next, autocorrelation and partial auto-correlation graphs are given.
# time series conversion
ts_train_consumption_daily <- ts(train_consumption_daily, frequency = 7)
ts_test_consumption_daily <- ts(test_consumption_daily, frequency = 7)
decomposed_ts_train_consumption_daily <- decompose(ts_train_consumption_daily)
plot(decomposed_ts_train_consumption_daily)
decomp_ts_random <- decomposed_ts_train_consumption_daily$random
summary(ur.kpss(decomp_ts_random))
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 0.0028
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
Seasonality pattern of weekly pattern is clear as it can be deduced from seasonal chart of the output. The random part that is residuals seems to have somewhat stationary property, but some outliers are required to be corrected.
par(mfrow = c(3,1))
plot(decomp_ts_random)
acf(decomp_ts_random, na.action = na.pass)
pacf(decomp_ts_random, na.action = na.pass)
par(mfrow = c(1,1))
Auto-correlationns and partial-autocorrelations are as seen. Some values imply positive correlations, and we will first try logging to see whether there is a non-constant variance in the data. Afterwards, we will try weekly differencing as the plots imply lag7 autocorrelations a little.
# logging the data
logged_ts_train_consumption_daily <- log(ts_train_consumption_daily)
logged_ts_test_consumption_daily <- log(ts_test_consumption_daily)
decomposed_logged_ts_train_consumption_daily <- decompose(logged_ts_train_consumption_daily)
plot(logged_ts_train_consumption_daily)
decomp_logged_ts_random <- decomposed_logged_ts_train_consumption_daily$random
summary(ur.kpss(decomp_logged_ts_random))
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 0.0028
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
par(mfrow = c(3,1))
plot(decomp_logged_ts_random)
acf(decomp_logged_ts_random, na.action = na.pass)
pacf(decomp_logged_ts_random, na.action = na.pass)
par(mfrow = c(1,1))
Since the logged values do not improve the residuals and stationarity, logging is deemed to be ineffective.
# differencing with lag7
differenced_logged_ts_train_consumption_daily <- ts(diff(ts_train_consumption_daily, 7), frequency = 7)
decomposed_differenced_logged_ts_train_consumption_daily <- decompose(differenced_logged_ts_train_consumption_daily)
decomp_diffed_logged_ts_random <- decomposed_differenced_logged_ts_train_consumption_daily$random
summary(ur.kpss(decomp_diffed_logged_ts_random))
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 0.004
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
Differencing seems to be not effective as the unit-root test gives higher p-value. Therefore differencing will not be used in the optimal model.
Therefore we will use base values and correct them with outlier operations.
First, we will get the indices of error outliers.
tsoutliers(decomp_ts_random)$index
## [1] 121 139 174 177 178 180 241 243 244 245 247 248 366 478 486
## [16] 528 529 531 532 595 596 598 599 600 603 604 607 667 731 843
## [31] 851 882 883 885 886 887 890 926 951 954 955 956 957 972 1032
## [46] 1093 1096 1197 1209 1217 1218 1235 1237 1238 1241 1242 1244 1292 1305 1306
## [61] 1307 1308 1309 1311 1398 1462 1463
As the inspection shows, new year days are always outliers. First, we will get the indices of data to correct those values.
outlier_indexes <- tsoutliers(decomp_ts_random)$index
outlier_values <- tsoutliers(decomp_ts_random)$replacements
newyear_indices <- c(1,outlier_indexes[c(15,33,51,76)])
newyear_indices
## [1] 1 486 883 1218 NA
Results will be a little bit better with replacements.
Getting rid of all these outliers and smoothing them via tsoutliers predictions:
ts_train_consumption_daily[outlier_indexes] <- outlier_values
summary(ur.kpss(decompose(ts_train_consumption_daily)$random))
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 0.0027
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
plot(decompose(ts_train_consumption_daily)$random)
Again, just a bit better stationarity.
Next remaining outliers will be pointed out from data outliers, not error outliers. Then, they will be replaced again with the tsoutliers function estimated replacement values.
still_outliers <- tsoutliers(ts_train_consumption_daily)
ts_train_consumption_daily[still_outliers$index] <- still_outliers$replacements
summary(ur.kpss(decompose(ts_train_consumption_daily)$random))
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 7 lags.
##
## Value of test-statistic is: 0.0029
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
plot(decompose(ts_train_consumption_daily)$random)
# performance metrics function
perf_dt=function(actual,forecast){
n=length(actual)
error=actual-forecast
mean=mean(actual)
sd=sd(actual)
FBias=sum(error)/sum(actual)
MPE=sum(error/actual)/n
MAPE=sum(abs(error/actual))/n
RMSE=sqrt(sum(error^2))/n
MAD=sum(abs(error))/n
WMAPE=MAD/mean
l=data.frame(n,mean,sd,FBias,MAPE,RMSE,MAD,WMAPE)
return(l)
}
xregr <- cbind(dt_daily$covid_severity,
dt_daily$sunlight_time_minutes,
dt_daily$production_capacity_rate,
dt_daily$price_of_electricity,
dt_daily$new_year,
dt_daily$nat_holiday,
dt_daily$sacrifice_holiday,
dt_daily$sacrifice_eve,
dt_daily$ramadan_holiday,
dt_daily$ramadan_eve,
dt_daily$monday_or_friday_between_holidays,
dt_daily$extra_holidays,
dt_daily$full_lockdown)
newxregr <- cbind(test_dt_daily$covid_severity,
test_dt_daily$sunlight_time_minutes,
test_dt_daily$production_capacity_rate,
test_dt_daily$price_of_electricity,
test_dt_daily$new_year,
test_dt_daily$nat_holiday,
test_dt_daily$sacrifice_holiday,
test_dt_daily$sacrifice_eve,
test_dt_daily$ramadan_holiday,
test_dt_daily$ramadan_eve,
test_dt_daily$monday_or_friday_between_holidays,
test_dt_daily$extra_holidays,
test_dt_daily$full_lockdown)
mmmf <- sarima.for(ts_train_consumption_daily, 1,0,0,0,0,0,7, n.ahead = guess_ahead, xreg = xregr, newxreg = newxregr)
perf_dt(actual = test_consumption_daily, forecast = mmmf$pred)
## n mean sd FBias MAPE RMSE MAD WMAPE
## 1 14 890563.3 60458.13 0.06860152 0.07487503 20338.75 67918.91 0.07626511
mmmf <- sarima.for(ts_train_consumption_daily, 0,0,0,1,0,0,7, n.ahead = guess_ahead, xreg = xregr, newxreg = newxregr)
perf_dt(actual = test_consumption_daily, forecast = mmmf$pred)
## n mean sd FBias MAPE RMSE MAD WMAPE
## 1 14 890563.3 60458.13 0.05072508 0.04995179 13896.3 45173.89 0.05072508
Since these model are manuel models, we will utilize auto.arima model to find better values for parameters
# auto.arima parameter finding
auto.arima(ts_train_consumption_daily, xreg = xregr, seasonal = T, trace = T)
##
## Fitting models using approximations to speed things up...
##
## Regression with ARIMA(2,0,2)(1,1,1)[7] errors : Inf
## Regression with ARIMA(0,0,0)(0,1,0)[7] errors : 34432.15
## Regression with ARIMA(1,0,0)(1,1,0)[7] errors : 32365.27
## Regression with ARIMA(0,0,1)(0,1,1)[7] errors : 33453.6
## ARIMA(0,0,0)(0,1,0)[7] : 34430.27
## Regression with ARIMA(1,0,0)(0,1,0)[7] errors : 32780.48
## Regression with ARIMA(1,0,0)(2,1,0)[7] errors : 32219.76
## Regression with ARIMA(1,0,0)(2,1,1)[7] errors : Inf
## Regression with ARIMA(1,0,0)(1,1,1)[7] errors : Inf
## Regression with ARIMA(0,0,0)(2,1,0)[7] errors : 34425.36
## Regression with ARIMA(2,0,0)(2,1,0)[7] errors : 32120.81
## Regression with ARIMA(2,0,0)(1,1,0)[7] errors : 32265.15
## Regression with ARIMA(2,0,0)(2,1,1)[7] errors : Inf
## Regression with ARIMA(2,0,0)(1,1,1)[7] errors : Inf
## Regression with ARIMA(3,0,0)(2,1,0)[7] errors : 32123.28
## Regression with ARIMA(2,0,1)(2,1,0)[7] errors : 32122.44
## Regression with ARIMA(1,0,1)(2,1,0)[7] errors : 32134.28
## Regression with ARIMA(3,0,1)(2,1,0)[7] errors : 32125.31
## ARIMA(2,0,0)(2,1,0)[7] : 32118.77
## ARIMA(2,0,0)(1,1,0)[7] : 32263.11
## ARIMA(2,0,0)(2,1,1)[7] : Inf
## ARIMA(2,0,0)(1,1,1)[7] : Inf
## ARIMA(1,0,0)(2,1,0)[7] : 32217.72
## ARIMA(3,0,0)(2,1,0)[7] : 32121.24
## ARIMA(2,0,1)(2,1,0)[7] : 32120.4
## ARIMA(1,0,1)(2,1,0)[7] : 32132.23
## ARIMA(3,0,1)(2,1,0)[7] : 32123.27
##
## Now re-fitting the best model(s) without approximations...
##
## ARIMA(2,0,0)(2,1,0)[7] : 32271.43
##
## Best model: Regression with ARIMA(2,0,0)(2,1,0)[7] errors
## Series: ts_train_consumption_daily
## Regression with ARIMA(2,0,0)(2,1,0)[7] errors
##
## Coefficients:
## ar1 ar2 sar1 sar2 xreg1 xreg2 xreg3 xreg4
## 1.2126 -0.2956 -0.6608 -0.3182 112.2152 -45.5971 608.0319 -83.3959
## s.e. 0.0280 0.0282 0.0258 0.0254 130.4411 152.9512 808.4956 150.3839
## xreg5 xreg6 xreg7 xreg8 xreg9 xreg10
## 948.939 -7626.194 -16866.05 -17913.766 -34941.865 -23690.760
## s.e. 4103.752 1750.308 6009.37 5290.854 5732.501 5197.141
## xreg11 xreg12 xreg13
## -28329.335 -4109.337 -8278.056
## s.e. 2883.339 2169.225 3451.554
##
## sigma^2 estimated as 218805653: log likelihood=-16117.48
## AIC=32270.95 AICc=32271.43 BIC=32366.14
Found parameters will be used in seasonal arima.
# sarima model
mmm <- sarima(ts_train_consumption_daily, 2,0,0,2,1,0,7, xreg = xregr)
## initial value 10.392477
## iter 2 value 10.208316
## iter 3 value 9.896441
## iter 4 value 9.806952
## iter 5 value 9.768203
## iter 6 value 9.744714
## iter 7 value 9.711160
## iter 8 value 9.675620
## iter 9 value 9.646956
## iter 10 value 9.617706
## iter 11 value 9.607328
## iter 12 value 9.601161
## iter 13 value 9.599885
## iter 14 value 9.598440
## iter 15 value 9.597334
## iter 16 value 9.596791
## iter 17 value 9.596054
## iter 18 value 9.595859
## iter 19 value 9.595760
## iter 20 value 9.595696
## iter 21 value 9.595623
## iter 22 value 9.595545
## iter 23 value 9.595454
## iter 24 value 9.595376
## iter 25 value 9.595336
## iter 26 value 9.595326
## iter 27 value 9.595324
## iter 28 value 9.595324
## iter 29 value 9.595324
## iter 30 value 9.595323
## iter 31 value 9.595323
## iter 31 value 9.595323
## iter 31 value 9.595323
## final value 9.595323
## converged
## initial value 9.597938
## iter 2 value 9.597831
## iter 3 value 9.597798
## iter 4 value 9.597796
## iter 5 value 9.597794
## iter 6 value 9.597793
## iter 7 value 9.597793
## iter 8 value 9.597792
## iter 9 value 9.597792
## iter 9 value 9.597792
## iter 9 value 9.597792
## final value 9.597792
## converged
mmm$ttable
## Estimate SE t.value p.value
## ar1 1.2126 0.0280 43.2345 0.0000
## ar2 -0.2956 0.0282 -10.4747 0.0000
## sar1 -0.6608 0.0258 -25.5750 0.0000
## sar2 -0.3182 0.0254 -12.5093 0.0000
## xreg1 112.2152 130.4411 0.8603 0.3898
## xreg2 -45.5971 152.9512 -0.2981 0.7657
## xreg3 608.0319 808.4956 0.7521 0.4521
## xreg4 -83.3959 150.3839 -0.5546 0.5793
## xreg5 948.9390 4103.7520 0.2312 0.8172
## xreg6 -7626.1936 1750.3083 -4.3571 0.0000
## xreg7 -16866.0467 6009.3697 -2.8066 0.0051
## xreg8 -17913.7655 5290.8543 -3.3858 0.0007
## xreg9 -34941.8652 5732.5012 -6.0954 0.0000
## xreg10 -23690.7600 5197.1412 -4.5584 0.0000
## xreg11 -28329.3354 2883.3387 -9.8252 0.0000
## xreg12 -4109.3372 2169.2249 -1.8944 0.0584
## xreg13 -8278.0559 3451.5541 -2.3984 0.0166
Additional regressors are mostly significant that is good.
# sarima forecasting
mmmf <- sarima.for(ts_train_consumption_daily, 2,0,0,2,1,0,7, n.ahead = guess_ahead, xreg = xregr, newxreg = newxregr)
# plot for forecast and actual
par(mfrow= c(2,1))
plot(mmmf$pred)
plot(ts_test_consumption_daily, type = "l")
par(mfrow= c(1,1))
# performance metrics
perf_dt(actual = test_consumption_daily, forecast = mmmf$pred)
## n mean sd FBias MAPE RMSE MAD WMAPE
## 1 14 890563.3 60458.13 0.03548704 0.0393802 11350.44 35937.37 0.04035353
Next, we will try moving window approach to see if better results can be reached. Again, auto.arima parameters will be used.
# moving_window approach
sliding_is_fun <- function(regressed = ts_train_consumption_daily, train_regressor = xregr, test_regressor = newxregr, sliding_window = 1)
{
slicet <- nrow(test_regressor)
holder <- nrow(regressed)
pred <- c()
# tester_regressor <- matrix(rep(1, 13), nrow = 1)
for(i in 1:slicet)
{
tester_regressor <- matrix(c(test_regressor[i,]), nrow = 1)
guessor <- sarima.for(regressed, 2,0,0,2,1,0,7, n.ahead = sliding_window, xreg = train_regressor, newxreg = tester_regressor)
regressed <- c(regressed, guessor$pred)
pred <- c(pred, guessor$pred)
train_regressor <- rbind(train_regressor, tester_regressor)
}
return(pred)
}
moving_forecasted_values <- sliding_is_fun()
moving_forecasted_values
## [1] 758828.0 883726.8 899110.7 905243.5 887979.6 877708.8 826391.5 750780.0
## [9] 875859.2 891912.6 898215.7 875773.8 870378.7 823517.1
perf_dt(actual = test_consumption_daily, forecast = moving_forecasted_values)
## n mean sd FBias MAPE RMSE MAD WMAPE
## 1 14 890563.3 60458.13 0.03548795 0.03938105 11350.71 35938.15 0.0403544
Results show that we have a good model for estimating 14 days with around 0.04 WMAPE. It shows that our models are good and estimations can be used as proxy supply values that can guide suppliers.